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Learning Outcomes 


e Understand the concept of data science pipelines that encompass data collection, 
preprocessing, feature engineering, and analysis. 

e Understand the difference between Data Science Pipeline & machine learning 
pipelines 

e Understand the concept of exploratory data analysis (EDA) to gain insights into data 
distributions, correlations, and patterns. 

e Understand the importance of selecting appropriate evaluation metrics to assess the 
performance of machine learning models within pipelines. 

e Understand the principles of automation, continuous integration, and continuous 
deployment (CI/CD) in the context of data science and machine learning. 

* Demonstrate the ability to present and communicate insights gained from data science 
and machine learning pipelines to both technical and non-technical audiences. 


Pre lecture quiz 


What is Data? 

What is Data preprocessing? 

What is the difference between Data preprocessing & Data 
Processing? 

What is a Machine Learning Model? 


Answers 


e What is Data? 

e Data refers to a collection of raw facts, figures, observations, or 
measurements that represent information about a specific subject 
or context. It can be in various forms such as text, numbers, images, 
or videos. 


e What is Data Preprocessing? 

e Data preprocessing involves the steps taken to clean, transform, and 
organize raw data into a format that is suitable for analysis and 
modeling. It includes tasks such as handling missing values, 
removing duplicates, and scaling features. 


Answers 


e Whatis the Difference between Data Preprocessing & Data Processing? 

e Data preprocessing is a subset of data processing. Data processing 
encompasses a broader range of activities that involve the 
manipulation, transformation, and analysis of data. Data preprocessing 
specifically focuses on preparing data for analysis by cleaning and 
organizing it. 


e What is a Machine Learning Model? 

e A machine learning model is a mathematical representation of patterns 
and relationships in data. It is created by training an algorithm on a 
dataset to learn from it and make predictions or decisions. The model 
can then be used to make predictions on new, unseen data. 


Overview 


e Data scientists excel at creating models that represent and predict 
real-world data, but effectively deploying machine learning models 
ismore of an art than science. Deployment requires skills more 
commonly found in software engineering and DevOps. 


e The goal of building a machine learning model is to solve a problem, 
and a machine learning model can only do so whenitisin 
production and actively in use by consumers. As such, model 
deployment is as important as model building. 


Data Science Pipelines 


e Data science pipelines as sequences of interconnected data 
processing steps. These pipelines transform raw data into actionable 
insights, ensuring consistency and efficiency in the analysis process. 

e The main components of a data science pipeline include data 
collection to preprocessing, feature engineering, model selection, 
training, and deployment. 


Building blocks for a Data Science Pipeline 
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Advantages of a DS Pipeline 


e enhance collaboration among team members, 

e reduce manual errors, 

e and facilitate tracking of changes, ultimately leading to improved 
data quality and informed decision-making. 


Challenges of a DS Pipeline 


e dealing with missing data, 

e data quality issues, 

e and complexities in feature engineering. 

Overcoming these challenges is crucial for maintaining pipeline 
Integrity. 


Machine Learning Pipelines 


e Machine learning pipelines are structured frameworks that guide 
the process of building, training, and deploying machine learning 


models. 
e These pipelines ensure consistency and reproducibility. 


e The building blocks in an ML pipeline include data preprocessing, 
feature scaling, model selection, hyperparameter tuning, cross- 
validation, model training, and deployment. 


Building blocks for a Machine Learning Project 
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Advantages of an ML Pipeline 


Machine learning pipelines: 

automate the model development process, 

enable faster iteration, 

and improve model performance, leading to more efficient and 
effective machine learning projects. 


Challenges of an ML Pipeline 


e handling varying data formats, 
e striking the balance between automation and customization, 
e and selecting suitable models and hyperparameters. 


e From model to production 


e Many teams embark on machine learning projects without a 
production plan, an approach that often leads to serious problems 
when it's time to deploy. It is both expensive and time-consuming to 
create models, and you should not invest in an ML project if you 
have no plan to put it in production, except of course when doing 
pure research. With a plan in hand, you won't be surprised by any 
pitfalls that could derail your launch. 


e DS & ML Pipelines 


e Data science and machine learning pipelines are essential tools in 
modern data-driven projects. They enable organizations to 
effectively manage and streamline the processes involved in data 
preparation, modeling, and deployment. 


From model to production 


There are three key areas to consider before embarking on any ML 
projects are: 


Data storage and retrieval 
Frameworks and tooling 
Feedback and iteration 


Software Orchestration: Ideal Scenario 
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Software Orchestration: Realistic standpoint 
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Class Exe rcise 1. The objective of this exercise is to 


build qn MLOps stock using the 
tools available on MyMLOps. 

2. Visit MyMLOps Builder and select 
the tools you wont to include in 
your stock. 

3. Research each tool you select 
and briefly summarize what each 
tool in your stack does. 

4, Once you have selected the tools, 
click on “Build Stock” to generate 
a template. 

5. Take a screenshot of your 
template and upload your files to 
moodle. 


Data storage and retrieval 


A machine learning model is of no use to anyone if it doesn't have 
any data associated with it. You'll likely have training, evaluation, 
testing, and even prediction data sets. You need to answer 
questions like: 


How is your training data stored? 

How large is your data? 

How will you retrieve the data for training? 

How will you retrieve data for prediction? 

These questions are important as they will guide you on what 
frameworks or tools to use, how to approach your problem, and 
how to design your ML model. 


Data storage and retrieval 


e Data can be stored in on-premise, in cloud storage, or in a hybrid of 
the two. It makes sense to store your data where the model training 
will occur and the results will be served: on-premise model training 
and serving will be best suited for on-premise data especially if the 
data is large, while data stored in cloud storage systems like GCS, 
AWS S3, or Azure storage should be matched with cloud ML training 
and serving. 


Data storage and retrieval 


e Even if you have your training data stored together with the model 
to be trained, you still need to consider how that data will be 
retrieved and processed. Here the question of batch vs. real-time 
data retrieval comes to mind, and this has to be considered before 
designing the ML system. Batch data retrieval means that data is 
retrieved in chunks from a storage system while real-time data 
retrieval means that data is retrieved as soon asit is available. 


Data storage and retrieval 


e Along with training data retrieval, you will also need to think about 
prediction data retrieval. Your prediction data is rarely as neatly 
packaged as the training data, so you need to consider a few more 
issues related to how your model will receive data at inference time: 


e Are you getting inference data from webpages? 
Are you receiving prediction requests from APIs? 
e Are you making batch or real-time predictions? 


Frameworks and tooling 


e Your model isn't going to train, run, and deploy itself. For that, you 
need frameworks and tooling, software and hardware that help you 
effectively deploy ML models. These can be frameworks like 
Tensorflow, Pytorch, and Scikit-Learn for training models, 
programming languages like Python, Java, and Go, and even cloud 
environments like AWS, GCP, and Azure. 


e After examining and preparing your use of data, the next line of 
thinking should consider what combination of frameworks and tools 
to use. 


Frameworks and tooling 


e The choice of framework is very important, as it can decide the 
continuity, maintenance, and use of a model. In this step, you must 
answer the following questions: 


What is the best tool for the task at hand? 

Are the choice of tools open-source or closed? 

How many platforms/targets support the tool? 

To help determine the best tool for the task, you should research 
and compare findings for different tools that perform the same job. 
For instance, you can compare these tools based on criteria like: 


Frameworks and tooling 


e Efficiency: How efficient is the framework or tool in production? A 
framework or tool is efficient if it optimally uses resources like 
memory, CPU, or time. It is important to consider the efficiency of 
Frameworks or tools you intend to use because they have a direct 
effect on project performance, reliability, and stability. 


e Popularity: How popular is the tool in the developer community? 
Popularity often means it works well, is actively in use, and has a lot 
of support. It is also worth mentioning that there may be newer 
tools that are less popular but more efficient than popular ones, 
especially for closed-source, proprietary tools. You'll need to weigh 
that when picking a proprietary tool to use. 


Frameworks and tooling 


e Support: How is support for the framework or tool? Does it have a 
vibrant community behind it if it is open-sourced, or does it have 
good support for closed-source tools?How fast can you find tips, 
tricks, tutorials, and other use cases in actual projects? 


e Next, you also need to know whether the tools or framework you 
have selected is open-source or not. There are pros and cons to this, 
and the answer will depend on things like budget, support, 
continuity, community, and so on. Sometimes, you can get a 
proprietary build of open-source software, which means you get the 
benefits of open source plus premium support. 


Frameworks and tooling 


e These tools enhances efficiency and productivity in pipeline 


development. 
popular tools for building data science pipelines. 
Pandas for data manipulation and Airflow for workflow automation. 


popular tools for building Machine Learning pipelines. 

libraries like sklearn.pipeline for pipeline creation, GridSearchCV for 
hyperparameter tuning, and MLflow for model lifecycle 
management. 


. Data Science Pipeline 


A data science pipeline encompasses a broader range of activities 
beyond machine learning. It focuses on the end-to-end process of 
extracting insights and knowledge from raw data. The stages 
involved in a data science pipeline typically include: 

Data Collection 

Data Cleaning and Preprocessing 

Exploratory Data Analysis (EDA) 

Feature Engineering 

Model Development 

Visualization and Reporting 

Decision Making 


. Machine Learning Pipeline 


A machine learning pipeline specificolly revolves around the process of 
building, training, evaluating, and deploying machine learning models. 
The stages in q machine learning pipeline typically include: 


* Data Collection 

e Data Cleaning and Preprocessing 
* Feature Engineering 

* Model Selection 

* Model Training and Tuning 

* Model Evaluation 

* Model Deployment 

e Automation and Monitoring 


Class Exercise 


Join at menti.com use code 3775 5942 


Instructions 


www.menti.com 


Enter the code 


3775 5942 


1. 


Iris Data Science 


Pipeline 
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Active Review 


1. What is the key thing you 
remember from today's lesson? 


2. What are you looking forward to in 
the next class? 
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Class Exercise 


1. Iris Data Science 
Pipeline 

2. Upload the Housing 
Dataset in Google 
Colab and run the code 
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LearnAl - Machine Learning on Azure Seth Mottaghinejad 
Wolfgang Pauli, PhD 


Resources for this Airlift 


Azure Subscriptions 
https://aka.ms/learnAlsubscriptions 


Azure Databricks Notebooks 
https://aka.ms/learnAlnotebooks.dbc 


Git Repository for LearnAl CustomAI Partner Airlift 


https://github.com/azure/learnai azure ml 


